Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.
Ensimmäisessä harjoituksessa installoitiin tarvittavat ohjelmat ja yhdistettiin verkkoympäristöön. Itselläni oli ongelmia saada R-studio ja GIT keskustelemaan keskenään ja tästä syystä jouduin tekemään installaatiot useaan otteeseen. Näiden ongelmien jälkeen pääsi harjoittelemaan DATACAMP ympäristössä, mutta jostain syystä platform näkyy normaalia huomattavan pienenä ja en ole tätä vielä ratkaissut.
R-Harjoituksissa käytiin läpi hyvin perus R-käyttöä, mutta hyvin nopeasti lopussa syöksyttiin funktioihin. Mielestäni tässä olisi ollut hyvä olla useampi videoluento aiheesta. Jokatapauksessa erittäin mielenkintoinen konsepti. En ole ennen käyttänyt GIT iä tai Markdownia , joten näistäkin on hyvä saada kokemusta.
Describe the work you have done this week and summarize your learning.
DATA tuonti
data <- read.csv("learing_2019.csv")
#Summary
summary(data)
## X gender Age Attitude Points
## Min. : 1.00 F:110 Min. :17.00 Min. :14.00 Min. : 7.00
## 1st Qu.: 42.25 M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00
## Median : 83.50 Median :22.00 Median :32.00 Median :23.00
## Mean : 83.50 Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:124.75 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :166.00 Max. :55.00 Max. :50.00 Max. :33.00
## Deep Stra Surf
## Min. :1.583 Min. :1.250 Min. :1.583
## 1st Qu.:3.333 1st Qu.:2.625 1st Qu.:2.417
## Median :3.667 Median :3.188 Median :2.833
## Mean :3.680 Mean :3.121 Mean :2.787
## 3rd Qu.:4.083 3rd Qu.:3.625 3rd Qu.:3.167
## Max. :4.917 Max. :5.000 Max. :4.333
#Access ggplot and GGally
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
#Creating graphics
p1 <- ggplot(data, aes(x = Attitude, y = Points, col = gender))
#and fitting linear model
p2 <- p1 + geom_point() + geom_smooth(method = "lm")
p2
In these graphs we can see that gender plays a role. Biggest difference between genders seems to be on variable attitude, where females have a clearly lower mean. The distributions of the attitude and surf variables differ between males and females.
#Correlations
p <- ggpairs(data, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
## Regression model with multible explanatory variables
my_model <- lm(Points ~ Attitude + Stra + Surf, data = data)
# Summary of the model
summary(my_model)
##
## Call:
## lm(formula = Points ~ Attitude + Stra + Surf, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## Stra 0.85313 0.54159 1.575 0.11716
## Surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Only attitude seems to be siqnificant in this fitted model. The multiple R squared of the model is in this case simply the square of the regression between Attitude and Points. Since it is around 0,2, approximately 20% of the variation of points can be explained by the variation of Attitude.
# drawing diagnostic plots using the plot() function. Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage.
par(mfrow = c(2,2))
plot(my_model, which = c(1,2,5))
The assumptions of the model are that the error terms are approximately normally distributed with mean 0 and identical variation, uncorrelated and independent of the variable of interest. Specifically, the size of the error should not depend on the value of the explanatory or interesting variables.
From these pictures it seems that the model assumptions are approximately correct, with the small exception of small and large values of Points corresponding to some larger deviation from the estimated mean. Also, a couple of observations seem to have somewhat large leverages, but overall the assumptions of the model seem to hold quite well. Questionable is the ends of the spectrum, and additional analysis is needed.
Mette Nissen 15.9.2019 Exercise 3, analysis with student alcohol consumption.
Exercise 3 #Exercise 3
alc <- read.csv("alc.csv", header = TRUE, sep = ",")
library(ggplot2)
library(GGally)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ✔ purrr 0.3.3
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(dplyr)
library(knitr)
I have taken variables gender, number of school abscences, going out with friends and final grade in the analysis. Hypothesis is that people performing poorly in school and missing classes are in higher risk of using more alcohol.
I also looked the summary and we can see that the mean age is 17 years. From the barchart we can see how high use is more common in men. THe next boxplot shows people having high use in alcohol going out in both genders. Also final grade seems to be lower. Using age as a function for abscences, we can see in the last figure how abscences seem to be higher in older men with high alcohol consumption.
Same things we can see from mean values. Group has 198 female and 184 male. Dividing groups with high use in females no high use vs high use is 156 and 42 respectevely and same numbers in men are 112 and 72, so higher proportion in men are high users. High use doesn´t affect final grade in female, but we can se a difference in grades in male students: 12.2 in non-high use group and 10.3 in high use group. Abscences are higher in both gender with high use and same is seen in going out with friends see tabels mean abscences and mean goout).
#Summary of the datatable
kable(summary(alc), digits = 2)
| X | school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | nursery | internet | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | higher | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | alc_use | high_use | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.00 | GP:342 | F:198 | Min. :15.00 | R: 81 | GT3:278 | A: 38 | Min. :0.000 | Min. :0.000 | at_home : 53 | at_home : 16 | course :140 | no : 72 | no : 58 | father: 91 | Min. :1.000 | Min. :1.000 | Min. :0.0000 | no :331 | no :144 | no :205 | no :181 | no : 18 | no :261 | Min. :1.000 | Min. :1.00 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. :1.000 | Min. : 0.0 | Min. : 2.00 | Min. : 4.00 | Min. : 0.00 | Min. :1.000 | Mode :logical | |
| 1st Qu.: 96.25 | MS: 40 | M:184 | 1st Qu.:16.00 | U:301 | LE3:104 | T:344 | 1st Qu.:2.000 | 1st Qu.:2.000 | health : 33 | health : 17 | home :110 | yes:310 | yes:324 | mother:275 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:0.0000 | yes: 51 | yes:238 | yes:177 | yes:201 | yes:364 | yes:121 | 1st Qu.:4.000 | 1st Qu.:3.00 | 1st Qu.:2.000 | 1st Qu.:1.000 | 1st Qu.:1.000 | 1st Qu.:3.000 | 1st Qu.: 1.0 | 1st Qu.:10.00 | 1st Qu.:10.00 | 1st Qu.:10.00 | 1st Qu.:1.000 | FALSE:268 | |
| Median :191.50 | NA | NA | Median :17.00 | NA | NA | NA | Median :3.000 | Median :3.000 | other :138 | other :211 | other : 34 | NA | NA | other : 16 | Median :1.000 | Median :2.000 | Median :0.0000 | NA | NA | NA | NA | NA | NA | Median :4.000 | Median :3.00 | Median :3.000 | Median :1.000 | Median :2.000 | Median :4.000 | Median : 3.0 | Median :12.00 | Median :12.00 | Median :12.00 | Median :1.500 | TRUE :114 | |
| Mean :191.50 | NA | NA | Mean :16.59 | NA | NA | NA | Mean :2.806 | Mean :2.565 | services: 96 | services:107 | reputation: 98 | NA | NA | NA | Mean :1.448 | Mean :2.037 | Mean :0.2016 | NA | NA | NA | NA | NA | NA | Mean :3.937 | Mean :3.22 | Mean :3.113 | Mean :1.482 | Mean :2.296 | Mean :3.573 | Mean : 4.5 | Mean :11.49 | Mean :11.47 | Mean :11.46 | Mean :1.889 | NA | |
| 3rd Qu.:286.75 | NA | NA | 3rd Qu.:17.00 | NA | NA | NA | 3rd Qu.:4.000 | 3rd Qu.:4.000 | teacher : 62 | teacher : 31 | NA | NA | NA | NA | 3rd Qu.:2.000 | 3rd Qu.:2.000 | 3rd Qu.:0.0000 | NA | NA | NA | NA | NA | NA | 3rd Qu.:5.000 | 3rd Qu.:4.00 | 3rd Qu.:4.000 | 3rd Qu.:2.000 | 3rd Qu.:3.000 | 3rd Qu.:5.000 | 3rd Qu.: 6.0 | 3rd Qu.:14.00 | 3rd Qu.:14.00 | 3rd Qu.:14.00 | 3rd Qu.:2.500 | NA | |
| Max. :382.00 | NA | NA | Max. :22.00 | NA | NA | NA | Max. :4.000 | Max. :4.000 | NA | NA | NA | NA | NA | NA | Max. :4.000 | Max. :4.000 | Max. :3.0000 | NA | NA | NA | NA | NA | NA | Max. :5.000 | Max. :5.00 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :5.000 | Max. :45.0 | Max. :18.00 | Max. :18.00 | Max. :18.00 | Max. :5.000 | NA |
# Graphs
p <- ggpairs(alc, columns = c("sex", "absences","goout", "G3", "high_use"), mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
?ggpairs
# initialize a barchart of alcohol use difference between genders
g1 <- ggplot(data = alc, aes(x=sex)) + geom_bar() + facet_wrap("high_use") + ggtitle("Student alcohol consumption by sex")
g1
# initialize a plot of alcohol use and going out with friends
g2 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = goout)) + facet_wrap("high_use") + ggtitle("Student going out with friends by alcohol consumption and sex")
g2
# Boxplot of alcohol use and school final grade
g3 <- ggplot() + geom_boxplot(data = alc, aes(x = sex, y = G3)) + facet_wrap("high_use") + ggtitle("Student final grades by alcohol consumption and sex")
g3
# Scatterplot showing linear model between age and abscences separated by alcohol consumption
g4 <- ggplot(data = alc, aes(x = age, y = absences, color=sex, alpha = 0.3)) + geom_point() + facet_wrap("high_use")+ geom_jitter() + geom_smooth(method = "lm") + ggtitle("Student absences by alcohol consumption, sex and age")
g4
alc %>% group_by(sex) %>% summarise(count = n())
## # A tibble: 2 x 2
## sex count
## <fct> <int>
## 1 F 198
## 2 M 184
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_grade
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 11.4
## 2 F TRUE 42 11.7
## 3 M FALSE 112 12.2
## 4 M TRUE 72 10.3
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_absences
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 4.22
## 2 F TRUE 42 6.79
## 3 M FALSE 112 2.98
## 4 M TRUE 72 6.12
alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 4 x 4
## # Groups: sex [2]
## sex high_use count mean_goout
## <fct> <lgl> <int> <dbl>
## 1 F FALSE 156 2.96
## 2 F TRUE 42 3.36
## 3 M FALSE 112 2.71
## 4 M TRUE 72 3.93
knitr::opts_chunk$set(echo = TRUE)
# glm() model
m <- glm(high_use ~ absences + sex + goout, data = alc, family = "binomial")
# the summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + sex + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
# the coefficients of the model
coef(m)
## (Intercept) absences sexM goout
## -4.16317087 0.08418399 0.95871896 0.72980859
# the odds ratio (OR)
OR <- coef(m) %>% exp
# the confidence interval (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
OR
## (Intercept) absences sexM goout
## 0.01555815 1.08782902 2.60835292 2.07468346
CI
## 2.5 % 97.5 %
## (Intercept) 0.005885392 0.03804621
## absences 1.042458467 1.13933894
## sexM 1.593132148 4.33151387
## goout 1.650182481 2.64111050
# printing out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.01555815 0.005885392 0.03804621
## absences 1.08782902 1.042458467 1.13933894
## sexM 2.60835292 1.593132148 4.33151387
## goout 2.07468346 1.650182481 2.64111050
From the model alone we can see that all other but G3 are statistically highly significant. Calculating OR and CI we can see that absences, male sex and going out have positive correlation and confidence interval staying above 1 stating significant correlation. In G3 correlation is negative and not significant. For the future predictions I am going to leave G3 from the model.
# probability of high_use
probabilities <- predict(m, type = "response")
# adding the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# using the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = (probability > 0.5))
# the last ten original classes, predicted probabilities, and class predictions
select(alc, goout, absences, sex, high_use, probability, prediction) %>% tail(10)
## goout absences sex high_use probability prediction
## 373 2 0 M FALSE 0.14869987 FALSE
## 374 3 7 M TRUE 0.39514446 FALSE
## 375 3 1 F FALSE 0.13129452 FALSE
## 376 3 6 F FALSE 0.18714923 FALSE
## 377 2 2 F FALSE 0.07342805 FALSE
## 378 4 2 F FALSE 0.25434555 FALSE
## 379 2 2 F FALSE 0.07342805 FALSE
## 380 1 3 F FALSE 0.03989428 FALSE
## 381 5 4 M TRUE 0.68596604 TRUE
## 382 1 2 M TRUE 0.09060457 FALSE
# target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 253 15
## TRUE 65 49
The last ten cases show real values in high use TRUE 3 and FALSE 7. In the prediction these numbers are TRUE 1 and FALSE 9.
# a plot of 'high_use' versus 'probability' in 'alc'
pr <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
pr2 <- pr + geom_point() + ggtitle("Predictions")
pr2
Here is calculated the error of our model with cross-validation:
# the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)%>%prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
# a loss function
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2094241
It seems that the wrong prediction proportion in this model 21% is smaller than 26% in the dataCamp exercise.
# access packages
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2)
library(GGally)
library(tidyverse)
library(dplyr)
library(knitr)
library(corrplot)
## corrplot 0.84 loaded
library(tidyr)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# load the data
data("Boston")
This data frame contains the following columns: crim = per capita crime rate by town. zn = proportion of residential land zoned for lots over 25,000 sq.ft. indus = proportion of non-retail business acres per town. chas = Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox = nitrogen oxides concentration (parts per 10 million). rm = average number of rooms per dwelling. age = proportion of owner-occupied units built prior to 1940. dis = weighted mean of distances to five Boston employment centres. rad = index of accessibility to radial highways. tax = full-value property-tax rate per $10,000. ptratio = pupil-teacher ratio by town. black = 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat = lower status of the population (percent). medv = median value of owner-occupied homes in $1000s.
In this weeks exercise we use Boston data set from R MASS package which is a histoical data collected from 606 districts in the area around Boston.
Boston has 14 variables and 506 observations. Crime variable is the response variable.
Variables and their explanations are show above.
#Dataset summary and variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
colnames(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
pairs(Boston)
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
#Graphical summary of crime variable
ggplot(Boston, aes(crim)) + stat_density() + theme_bw()
#Plotting each variable against crime rate
bosmelt <- melt(Boston, id="crim")
ggplot(bosmelt, aes(x=value, y=crim))+
facet_wrap(~variable, scales="free")+
geom_point()
boxplot(Boston$crim, Boston$zn, Boston$indus, Boston$chas, Boston$nox, Boston$rm, Boston$age, Boston$dis, Boston$rad, Boston$tax, Boston$ptratio, Boston$black, Boston$lstat, Boston$medv, names = c("crim", "zn", "indus", "chas", "nox", "rm", "age", "dis", "rad", "tax", "ptratio", "black", "lstat", "medv"))
mlm <- lm(formula = crim ~ ., data = Boston)
summary(mlm)
##
## Call:
## lm(formula = crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.924 -2.120 -0.353 1.019 75.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.033228 7.234903 2.354 0.018949 *
## zn 0.044855 0.018734 2.394 0.017025 *
## indus -0.063855 0.083407 -0.766 0.444294
## chas -0.749134 1.180147 -0.635 0.525867
## nox -10.313535 5.275536 -1.955 0.051152 .
## rm 0.430131 0.612830 0.702 0.483089
## age 0.001452 0.017925 0.081 0.935488
## dis -0.987176 0.281817 -3.503 0.000502 ***
## rad 0.588209 0.088049 6.680 6.46e-11 ***
## tax -0.003780 0.005156 -0.733 0.463793
## ptratio -0.271081 0.186450 -1.454 0.146611
## black -0.007538 0.003673 -2.052 0.040702 *
## lstat 0.126211 0.075725 1.667 0.096208 .
## medv -0.198887 0.060516 -3.287 0.001087 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4396
## F-statistic: 31.47 on 13 and 492 DF, p-value: < 2.2e-16
Most significant variables in the model are dis and rad with high significance, median value with moderate significance and zone, black with lower but still under p 0.05 significance.
cor_matrix<-cor(Boston)
cor_matrix %>% round(digits = 2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot.mixed(cor_matrix, number.cex = .6)
Corrplot shows the relationships between variables. Highest positive correlation are between rad and tax, indus and nox and age and nox. Highest negative correlations are between age and dis, lstat and med and dis and nox. Wee can see from the summaries that distribution of the variables is very inconsistent and thus we need to scale the dataset before doing linear discriminant analysis later.
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
With standardizing data is centralized. This is done to continuous variables on unit scale by subtracting the mean of the variable and dividing the result by the variable’s standard deviation. With this variables´mean is 0 and SD is 1.
# creating a quantile vector of crim
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#removing crim
boston_scaled <- dplyr::select(boston_scaled, -crim)
#adding categorical variable to the table
boston_scaled <- data.frame(boston_scaled, crime)
For predicting with data we need a model training set which is in this case decided to be 80% of the cases and the rest of the data is used as a test set which shows the accuracy of the model.
n <- nrow(boston_scaled)
#Choosing 80% to the training set
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
# creating the test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2351485 0.2648515 0.2599010 0.2400990
##
## Group means:
## zn indus chas nox rm age
## low 0.99382637 -0.9425846 -0.06511327 -0.9138794 0.45964590 -0.8838782
## med_low -0.09012577 -0.3210174 -0.05155709 -0.5736368 -0.10200686 -0.3442029
## med_high -0.39658567 0.2242815 0.14012905 0.4296007 0.07222506 0.5034917
## high -0.48724019 1.0149946 -0.02879709 1.0600922 -0.35441183 0.8337612
## dis rad tax ptratio black lstat
## low 0.9579618 -0.7038213 -0.7287184 -0.4880429 0.3751934 -0.77868566
## med_low 0.3602152 -0.5439511 -0.4654886 -0.1055134 0.3189766 -0.15801437
## med_high -0.4122934 -0.4185755 -0.2935570 -0.2574838 0.1651912 0.06980203
## high -0.8543703 1.6596029 1.5294129 0.8057784 -0.8461899 0.95182486
## medv
## low 0.5095235
## med_low 0.0203129
## med_high 0.1405777
## high -0.7993128
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.122170990 0.586964270 -1.14083441
## indus 0.005181891 -0.341794759 0.22040463
## chas -0.075824154 0.051400685 0.06966214
## nox 0.237745167 -0.790955322 -1.34198690
## rm -0.084634265 -0.014945171 -0.11799973
## age 0.332732763 -0.397683317 -0.27579868
## dis -0.114669377 -0.179384614 0.02969404
## rad 3.385281510 0.712762323 -0.05839673
## tax 0.011554485 0.326727853 0.58349973
## ptratio 0.128303202 0.003442482 -0.41136016
## black -0.147418275 -0.015178506 0.12492664
## lstat 0.174540976 -0.187036271 0.31762396
## medv 0.135728495 -0.444657212 -0.26665753
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9450 0.0423 0.0127
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
l <- plot(lda.fit, dimen = 2, col = classes, pch = classes)
l + lda.arrows(lda.fit, myscale = 1)
## integer(0)
# saving the correct classes from test data
correct_classes <- test$crime
# removing the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 12 5 0
## med_low 2 14 3 0
## med_high 2 8 10 1
## high 0 0 1 29
From the cross table we can see that high values are predicted very nicely, but in the lower classes more errors occure.
# Boston dataset reading and standardization again
data("Boston")
b_boston_scaled <- scale(Boston)
# Distances with euclidean distance
dist_eu <- dist(b_boston_scaled)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
km <- kmeans(b_boston_scaled, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(b_boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal cluster size is the point where the line drops. In this it seems to be two.
# Clustering again
km2 <- kmeans(b_boston_scaled, centers = 2)
pairs(b_boston_scaled[,1:7], col = km2$cluster)
pairs(b_boston_scaled[,8:14], col = km2$cluster)
km3 <- kmeans(Boston, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km4 <- kmeans(Boston, centers = 2)
pairs(Boston[,1:7], col = km4$cluster)
pairs(Boston[,8:14], col = km4$cluster)
bins <- quantile(Boston$crim)
crime2 <- cut(Boston$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
table(crime2)
## crime2
## low med_low med_high high
## 127 126 126 127
Boston <- dplyr::select(Boston, -crim)
Boston <- data.frame(Boston, crime2)
u <- nrow(Boston)
ind2 <- sample(u, size = u * 0.8)
train2 <- Boston[ind2,]
test2 <- Boston[-ind2,]
lda.fit2 <- lda(crime2 ~ ., data = train2)
lda.fit2
## Call:
## lda(crime2 ~ ., data = train2)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2500000 0.2351485 0.2623762
##
## Group means:
## zn indus chas nox rm age dis
## low 33.348039 5.173333 0.03921569 0.4534382 6.584784 43.77549 5.657954
## med_low 8.094059 9.092871 0.04950495 0.4921772 6.197436 59.68119 4.467285
## med_high 2.547368 12.335158 0.15789474 0.6040526 6.402337 81.93684 2.911674
## high 0.000000 18.100000 0.05660377 0.6798208 6.005179 91.18208 2.027506
## rad tax ptratio black lstat medv
## low 3.490196 282.8333 17.54412 390.9058 7.175784 27.00098
## med_low 4.742574 321.8812 18.31683 385.1081 11.669109 22.51485
## med_high 5.736842 349.4211 17.51789 364.4000 12.678211 24.52947
## high 24.000000 666.0000 20.20000 278.7887 18.729623 15.95849
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0059594595 0.0291392963 -0.043654152
## indus 0.0116837992 -0.0194328871 0.028512310
## chas -0.4398527204 -0.4347835204 -0.196120764
## nox 1.9916327959 -5.6218782002 -10.346653879
## rm -0.1892416869 -0.1153374711 -0.261021540
## age 0.0080682351 -0.0142044048 -0.007142634
## dis -0.0553948278 -0.0682297733 0.054422317
## rad 0.4673782227 0.1037951245 -0.004063668
## tax 0.0004496804 -0.0003602709 0.002638328
## ptratio 0.0678345162 0.0524262621 -0.082236100
## black -0.0015035340 0.0003945387 0.001896470
## lstat 0.0294406326 -0.0448026121 0.058101833
## medv 0.0234646334 -0.0462138320 -0.009722294
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9645 0.0275 0.0081
lda.arrows2 <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes2 <- as.numeric(train2$crime2)
# plot the lda results
l <- plot(lda.fit2, dimen = 2, col = classes2, pch = classes2)
l + lda.arrows2(lda.fit2, myscale = 2)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
## integer(0)
Nox and seems to be the most influencal linear separators in analysis without standardization.
I don´t get the colors right. Otherwise nice
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
This week we are using human dataset from the United Nations Development Programme. From this dataset we have selected 8 variables wich are: country - Country Edu2.FM - secundry education rate female to male ratio Labo.FM - Labour force participation rate female to male ratio Life.Exp - Life expectancy at birth GNI - Gender Development Index Mat.Mor - Maternal mortality ratio Ado.Birth - Adolescent birth rate Parli.F - Percent of female representation in Parliament
human <- read.csv("data/human.csv", row.names = 1)
library(GGally)
library(tidyverse)
library(corrplot)
library(dplyr)
library(knitr)
library(corrplot)
library(tidyr)
library(reshape2)
library(plotly)
library(FactoMineR)
library(ggplot2)
str(human)
## 'data.frame': 148 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1980 Min. :0.1857 Min. : 7.00 Min. :49.00
## 1st Qu.:0.8097 1st Qu.:0.5971 1st Qu.:11.57 1st Qu.:68.38
## Median :0.9444 Median :0.7504 Median :13.55 Median :74.35
## Mean :0.8766 Mean :0.7006 Mean :13.42 Mean :72.44
## 3rd Qu.:0.9990 3rd Qu.:0.8385 3rd Qu.:15.32 3rd Qu.:77.45
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 680 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 5190 1st Qu.: 11.0 1st Qu.: 12.18 1st Qu.:12.15
## Median : 12408 Median : 45.5 Median : 31.30 Median :19.50
## Mean : 18402 Mean :120.9 Mean : 43.72 Mean :20.95
## 3rd Qu.: 25350 3rd Qu.:170.0 3rd Qu.: 69.05 3rd Qu.:27.82
## Max. :123124 Max. :730.0 Max. :175.60 Max. :57.50
pairs <- ggpairs(human)
pairs
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot()
From the summary we can see that distributions differ between variables. GNI min is 680 and max is 123124 with comparison of Labo.FM with range of 0.1857-1.0380. In the dataset values are different. Correlation between variables differ from strongly positive correlation (maternal mortality - rate of adolescent women giving birth) to negative corelation (maternal mortality - life expectancy).
#PCA analysis
pca_human <- prcomp(human)
pca_human
## Standard deviations (1, .., p=8):
## [1] 1.862483e+04 1.420218e+02 2.116337e+01 1.147634e+01 3.608478e+00
## [6] 1.571358e+00 1.923557e-01 1.512244e-01
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -4.649714e-06 0.0006867686 -0.0013662633 2.317587e-04 -0.0055155087
## Labo.FM -9.719607e-08 -0.0003158778 0.0003015743 4.425582e-03 0.0052146757
## Edu.Exp -8.706739e-05 0.0087580112 0.0053006118 3.255666e-02 0.1415081241
## Life.Exp -2.533543e-04 0.0333850374 -0.0048336882 7.494569e-02 0.9861585472
## GNI -9.999893e-01 -0.0046050756 -0.0003728181 -6.050839e-05 -0.0001023982
## Mat.Mor 4.477844e-03 -0.9863064242 0.1609608588 1.124047e-02 0.0337070459
## Ado.Birth 1.111169e-03 -0.1611663247 -0.9864335638 -3.008031e-02 0.0039566221
## Parli.F -5.580863e-05 0.0034657989 -0.0314144646 9.961287e-01 -0.0791032655
## PC6 PC7 PC8
## Edu2.FM 1.762118e-02 6.384062e-01 7.694765e-01
## Labo.FM 3.217052e-02 7.688482e-01 -6.385848e-01
## Edu.Exp 9.886674e-01 -3.580796e-02 8.073940e-03
## Life.Exp -1.437809e-01 4.420684e-03 6.632638e-03
## GNI -2.864828e-05 -1.086468e-06 -1.752803e-06
## Mat.Mor 2.675068e-03 1.432260e-04 1.224251e-03
## Ado.Birth 7.122449e-03 -7.522696e-04 -1.109192e-03
## Parli.F -2.145731e-02 -2.750969e-03 1.847856e-03
#Summary of principal component analysis
s <- summary(pca_human)
# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1)
# creating an object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink"), xlab = pc_lab[1], ylab = pc_lab[2], main = (title = "PCA_non-scaled"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
From not scaled data pca is not useful because of large impact of GNI which has the largest SD and resulting first components 100% of variance. Therefore we must scale the data to make valid analysis.
# standardize the variables
human_std <- scale(human)
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-3.1089 Min. :-2.6159 Min. :-2.42684 Min. :-3.0723
## 1st Qu.:-0.3065 1st Qu.:-0.5256 1st Qu.:-0.69805 1st Qu.:-0.5329
## Median : 0.3108 Median : 0.2531 Median : 0.04826 Median : 0.2503
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.5610 3rd Qu.: 0.7005 3rd Qu.: 0.71899 3rd Qu.: 0.6566
## Max. : 2.8414 Max. : 1.7144 Max. : 2.56114 Max. : 1.4495
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9515 Min. :-0.7355 Min. :-1.1573 Min. :-1.8197
## 1st Qu.:-0.7094 1st Qu.:-0.6742 1st Qu.:-0.8466 1st Qu.:-0.7643
## Median :-0.3218 Median :-0.4626 Median :-0.3333 Median :-0.1258
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3730 3rd Qu.: 0.3009 3rd Qu.: 0.6799 3rd Qu.: 0.5973
## Max. : 5.6228 Max. : 3.7352 Max. : 3.5397 Max. : 3.1749
#PCA analysis with scaled variables
pca_human_std <- prcomp(human_std)
pca_human_std
## Standard deviations (1, .., p=8):
## [1] 2.0333042 1.1429035 0.8966375 0.7974445 0.7016064 0.5533825 0.4782085
## [8] 0.3039769
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## Edu2.FM -0.32465414 0.09227181 -0.43377536 0.71962839 -0.3646631
## Labo.FM 0.02176326 0.72972518 -0.51145278 -0.19826256 0.3349423
## Edu.Exp -0.42571462 0.14257522 -0.03704407 -0.14368290 0.1144547
## Life.Exp -0.44696256 -0.02108739 0.13444430 -0.07252202 0.1544074
## GNI -0.36191020 0.03284391 -0.10502738 -0.55841503 -0.6927543
## Mat.Mor 0.44847613 0.16196165 -0.07472686 -0.20980104 -0.2594996
## Ado.Birth 0.41583795 0.14603035 -0.06162895 0.13254030 -0.3781145
## Parli.F -0.08992529 0.62416308 0.71441898 0.20859522 -0.1663542
## PC6 PC7 PC8
## Edu2.FM -0.102011694 0.06004801 0.18184654
## Labo.FM -0.113322727 -0.17601955 -0.10061935
## Edu.Exp 0.611088764 0.62425612 0.01404528
## Life.Exp 0.296297409 -0.63847349 0.50711229
## GNI -0.176468208 -0.08563262 -0.16340648
## Mat.Mor 0.004512949 0.24190007 0.77276191
## Ado.Birth 0.683004537 -0.31602881 -0.27395047
## Parli.F -0.133691243 0.04841963 -0.02315017
# rounded percentages of variance captured by each PC
s_st <- summary(pca_human_std)
pca_pr_st <- round(100*s_st$importance[2,], digits = 1)
# creating an object pc_lab to be used as axis labels
pc_lab_st <- paste0(names(pca_pr_st), " (", pca_pr_st, "%)")
# drawing a biplot of the principal component representation and the original variables
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 0.8), col = c("grey40", "blue"), xlab = pc_lab_st[1], ylab = pc_lab_st[2], main = (title = "PCA_scaled"))
With scaling analysis shows a more reliable result. Rwanda seems to be an outlier in this 2-dimensional biplot. PC1 + PC2 are accounted for 68% of the variance which is quite a lot.
Arrow show similar effect in GII - gender inequality categories (labour and parliament) and welfare categories (education, income, life expectancy and maternal health). Seeing the angles between these groups, it seems that they correlate quite well with each other.
data(tea)
# What is tea all about
dim(tea)
## [1] 300 36
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
Tea dataset includes 36 variables describing tea-taking habits with 300 observations. Most of the variables are strings and some are bimodal. This dataset is too large for a reasonable analysis and therefore I have picked variables with health attributes.
# column names to keep in the dataset
keep_columns <- c("Tea", "sugar", "pub", "friends", "sport", "healthy", "effect.on.health", "slimming", "relaxing")
# selecting the 'keep_columns' to create a new dataset
tea_health <- select(tea, one_of(keep_columns))
## Warning: Unknown columns: `sport`
# look at the summaries and structure of the tea_health data
summary(tea_health)
## Tea sugar pub friends
## black : 74 No.sugar:155 Not.pub:237 friends :196
## Earl Grey:193 sugar :145 pub : 63 Not.friends:104
## green : 33
## healthy effect.on.health slimming
## healthy :210 effect on health : 66 No.slimming:255
## Not.healthy: 90 No.effect on health:234 slimming : 45
##
## relaxing
## No.relaxing:113
## relaxing :187
##
str(tea_health)
## 'data.frame': 300 obs. of 8 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
gather(tea_health) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
pairs <- ggpairs(tea_health)
pairs
Most of the tea drinkers drink earl grey with friends. 210 think tea is healthy but only 66 think tea has an effect on health. Majority (187) think tea is relaxing. Sugar is used more often with Earl Grey
# multiple correspondence analysis
mca <- MCA(tea_health, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_health, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.177 0.175 0.153 0.129 0.123 0.112 0.099
## % of var. 15.749 15.527 13.567 11.437 10.899 9.959 8.804
## Cumulative % of var. 15.749 31.275 44.842 56.279 67.178 77.137 85.940
## Dim.8 Dim.9
## Variance 0.087 0.071
## % of var. 7.731 6.329
## Cumulative % of var. 93.671 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
## 1 | -0.309 0.179 0.087 | 0.395 0.298 0.142 | -0.111 0.027
## 2 | -0.530 0.528 0.259 | 0.622 0.739 0.357 | 0.121 0.032
## 3 | -0.202 0.076 0.086 | -0.251 0.120 0.133 | -0.036 0.003
## 4 | -0.129 0.031 0.025 | -0.184 0.065 0.052 | -0.346 0.261
## 5 | 0.133 0.033 0.020 | 0.269 0.138 0.082 | -0.136 0.040
## 6 | -0.350 0.230 0.191 | 0.043 0.003 0.003 | -0.114 0.028
## 7 | -0.202 0.076 0.086 | -0.251 0.120 0.133 | -0.036 0.003
## 8 | -0.610 0.700 0.390 | 0.323 0.200 0.110 | 0.221 0.106
## 9 | 0.133 0.033 0.020 | 0.269 0.138 0.082 | -0.136 0.040
## 10 | -0.610 0.700 0.390 | 0.323 0.200 0.110 | 0.221 0.106
## cos2
## 1 0.011 |
## 2 0.014 |
## 3 0.003 |
## 4 0.181 |
## 5 0.021 |
## 6 0.020 |
## 7 0.003 |
## 8 0.051 |
## 9 0.021 |
## 10 0.051 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2 v.test
## black | -0.512 4.559 0.086 -5.064 | 0.560 5.528 0.103 5.537
## Earl Grey | 0.363 5.991 0.238 8.438 | -0.379 6.599 0.259 -8.792
## green | -0.977 7.410 0.118 -5.940 | 0.959 7.243 0.114 5.831
## No.sugar | -0.360 4.719 0.138 -6.432 | 0.367 4.981 0.144 6.562
## sugar | 0.385 5.044 0.138 6.432 | -0.392 5.325 0.144 -6.562
## Not.pub | -0.041 0.092 0.006 -1.366 | 0.267 4.033 0.268 8.957
## pub | 0.153 0.348 0.006 1.366 | -1.005 15.170 0.268 -8.957
## friends | 0.173 1.383 0.057 4.112 | -0.340 5.416 0.218 -8.079
## Not.friends | -0.326 2.607 0.057 -4.112 | 0.641 10.207 0.218 8.079
## healthy | -0.488 11.773 0.556 -12.896 | -0.227 2.584 0.120 -5.999
## Dim.3 ctr cos2 v.test
## black | 0.936 17.710 0.287 9.264 |
## Earl Grey | -0.108 0.613 0.021 -2.506 |
## green | -1.469 19.430 0.267 -8.928 |
## No.sugar | 0.351 5.200 0.131 6.267 |
## sugar | -0.375 5.559 0.131 -6.267 |
## Not.pub | -0.092 0.546 0.032 -3.080 |
## pub | 0.345 2.053 0.032 3.080 |
## friends | 0.085 0.382 0.013 2.006 |
## Not.friends | -0.159 0.720 0.013 -2.006 |
## healthy | 0.021 0.025 0.001 0.549 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.255 0.271 0.461 |
## sugar | 0.138 0.144 0.131 |
## pub | 0.006 0.268 0.032 |
## friends | 0.057 0.218 0.013 |
## healthy | 0.556 0.120 0.001 |
## effect.on.health | 0.345 0.131 0.181 |
## slimming | 0.043 0.010 0.379 |
## relaxing | 0.017 0.235 0.023 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
With this analysis we can see that 31.275% of this models variance can be explained with the first 2 dimensions. It seems that social aspect is more affected by the Dim 2 and ohysical health with Dim 1. According to minor clusters it seems that people drinking tea with friens use sugar and they think tea is relaxing compaired with people drinking tea without friends use black tea with no sugar.
library(dplyr)
library(tidyr)
###Load the data sets (BPRS and RATS)
BPRS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt", header = TRUE, sep = " ")
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
###Inspecting the WIDE data
names(BPRS)
## [1] "treatment" "subject" "week0" "week1" "week2" "week3"
## [7] "week4" "week5" "week6" "week7" "week8"
names(RATS)
## [1] "ID" "Group" "WD1" "WD8" "WD15" "WD22" "WD29" "WD36" "WD43"
## [10] "WD44" "WD50" "WD57" "WD64"
summary(BPRS)
## treatment subject week0 week1 week2
## Min. :1.0 Min. : 1.00 Min. :24.00 Min. :23.00 Min. :26.0
## 1st Qu.:1.0 1st Qu.: 5.75 1st Qu.:38.00 1st Qu.:35.00 1st Qu.:32.0
## Median :1.5 Median :10.50 Median :46.00 Median :41.00 Median :38.0
## Mean :1.5 Mean :10.50 Mean :48.00 Mean :46.33 Mean :41.7
## 3rd Qu.:2.0 3rd Qu.:15.25 3rd Qu.:58.25 3rd Qu.:54.25 3rd Qu.:49.0
## Max. :2.0 Max. :20.00 Max. :78.00 Max. :95.00 Max. :75.0
## week3 week4 week5 week6 week7
## Min. :24.00 Min. :20.00 Min. :20.00 Min. :19.00 Min. :18.0
## 1st Qu.:29.75 1st Qu.:28.00 1st Qu.:26.00 1st Qu.:22.75 1st Qu.:23.0
## Median :36.50 Median :34.50 Median :30.50 Median :28.50 Median :30.0
## Mean :39.15 Mean :36.35 Mean :32.55 Mean :31.23 Mean :32.2
## 3rd Qu.:44.50 3rd Qu.:43.00 3rd Qu.:38.00 3rd Qu.:37.00 3rd Qu.:38.0
## Max. :76.00 Max. :66.00 Max. :64.00 Max. :64.00 Max. :62.0
## week8
## Min. :20.00
## 1st Qu.:22.75
## Median :28.00
## Mean :31.43
## 3rd Qu.:35.25
## Max. :75.00
summary(RATS)
## ID Group WD1 WD8 WD15
## Min. : 1.00 Min. :1.00 Min. :225.0 Min. :230.0 Min. :230.0
## 1st Qu.: 4.75 1st Qu.:1.00 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## Median : 8.50 Median :1.50 Median :340.0 Median :345.0 Median :347.5
## Mean : 8.50 Mean :1.75 Mean :365.9 Mean :369.1 Mean :372.5
## 3rd Qu.:12.25 3rd Qu.:2.25 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## Max. :16.00 Max. :3.00 Max. :555.0 Max. :560.0 Max. :565.0
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
str(RATS)
## 'data.frame': 16 obs. of 13 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
###Treatment weeks seems to be in different colums
###Convert categorical data to factors - BPRS
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)
BPRSL <- BPRS %>% tidyr::gather(key = weeks, value = bprs, -treatment, -subject)
BPRSL <- BPRSL %>% dplyr::mutate(week = as.integer(substr(weeks, 5,5)))
###And the same preparation to RATS
Convert categorical data to factors RATS and Convert data to long form and extract the time.
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)
RATSL <- RATS %>%
tidyr::gather(key = WD, value = Weight, -ID, -Group) %>%
dplyr::mutate(Time = as.integer(substr(WD, 3, 4)))
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
names(RATSL)
## [1] "ID" "Group" "WD" "Weight" "Time"
In the long form time is in one column and in wide time variables are in different columns
#Access the package ggplot2
library(ggplot2)
##Exploring RATS and RATSL dataset
glimpse(RATS)
## Observations: 16
## Variables: 13
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ WD1 <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ WD8 <int> 250, 230, 250, 255, 260, 265, 275, 255, 415, 420, 445, 560, 465…
## $ WD15 <int> 255, 230, 250, 255, 255, 270, 260, 260, 425, 430, 450, 565, 475…
## $ WD22 <int> 260, 232, 255, 265, 270, 275, 270, 268, 428, 440, 452, 580, 485…
## $ WD29 <int> 262, 240, 262, 265, 270, 275, 273, 270, 438, 448, 455, 590, 487…
## $ WD36 <int> 258, 240, 265, 268, 273, 277, 274, 265, 443, 460, 455, 597, 493…
## $ WD43 <int> 266, 243, 267, 270, 274, 278, 276, 265, 442, 458, 451, 595, 493…
## $ WD44 <int> 266, 244, 267, 272, 273, 278, 271, 267, 446, 464, 450, 595, 504…
## $ WD50 <int> 265, 238, 264, 274, 276, 284, 282, 273, 456, 475, 462, 612, 507…
## $ WD57 <int> 272, 247, 268, 273, 278, 279, 281, 274, 468, 484, 466, 618, 518…
## $ WD64 <int> 278, 245, 269, 275, 280, 281, 284, 278, 478, 496, 472, 628, 525…
head(RATS)
## ID Group WD1 WD8 WD15 WD22 WD29 WD36 WD43 WD44 WD50 WD57 WD64
## 1 1 1 240 250 255 260 262 258 266 266 265 272 278
## 2 2 1 225 230 230 232 240 240 243 244 238 247 245
## 3 3 1 245 250 250 255 262 265 267 267 264 268 269
## 4 4 1 260 255 255 265 265 268 270 272 274 273 275
## 5 5 1 255 260 255 270 270 273 274 273 276 278 280
## 6 6 1 260 265 270 275 275 277 278 278 284 279 281
str(RATS)
## 'data.frame': 16 obs. of 13 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD1 : int 240 225 245 260 255 260 275 245 410 405 ...
## $ WD8 : int 250 230 250 255 260 265 275 255 415 420 ...
## $ WD15 : int 255 230 250 255 255 270 260 260 425 430 ...
## $ WD22 : int 260 232 255 265 270 275 270 268 428 440 ...
## $ WD29 : int 262 240 262 265 270 275 273 270 438 448 ...
## $ WD36 : int 258 240 265 268 273 277 274 265 443 460 ...
## $ WD43 : int 266 243 267 270 274 278 276 265 442 458 ...
## $ WD44 : int 266 244 267 272 273 278 271 267 446 464 ...
## $ WD50 : int 265 238 264 274 276 284 282 273 456 475 ...
## $ WD57 : int 272 247 268 273 278 279 281 274 468 484 ...
## $ WD64 : int 278 245 269 275 280 281 284 278 478 496 ...
summary(RATS)
## ID Group WD1 WD8 WD15
## 1 : 1 1:8 Min. :225.0 Min. :230.0 Min. :230.0
## 2 : 1 2:4 1st Qu.:252.5 1st Qu.:255.0 1st Qu.:255.0
## 3 : 1 3:4 Median :340.0 Median :345.0 Median :347.5
## 4 : 1 Mean :365.9 Mean :369.1 Mean :372.5
## 5 : 1 3rd Qu.:480.0 3rd Qu.:476.2 3rd Qu.:486.2
## 6 : 1 Max. :555.0 Max. :560.0 Max. :565.0
## (Other):10
## WD22 WD29 WD36 WD43
## Min. :232.0 Min. :240.0 Min. :240.0 Min. :243.0
## 1st Qu.:267.2 1st Qu.:268.8 1st Qu.:267.2 1st Qu.:269.2
## Median :351.5 Median :356.5 Median :360.0 Median :360.0
## Mean :379.2 Mean :383.9 Mean :387.0 Mean :386.0
## 3rd Qu.:492.5 3rd Qu.:497.8 3rd Qu.:504.2 3rd Qu.:501.0
## Max. :580.0 Max. :590.0 Max. :597.0 Max. :595.0
##
## WD44 WD50 WD57 WD64
## Min. :244.0 Min. :238.0 Min. :247.0 Min. :245.0
## 1st Qu.:270.0 1st Qu.:273.8 1st Qu.:273.8 1st Qu.:278.0
## Median :362.0 Median :370.0 Median :373.5 Median :378.0
## Mean :388.3 Mean :394.6 Mean :398.6 Mean :404.1
## 3rd Qu.:510.5 3rd Qu.:516.0 3rd Qu.:524.5 3rd Qu.:530.8
## Max. :595.0 Max. :612.0 Max. :618.0 Max. :628.0
##
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1,…
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1",…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 47…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8,…
head(RATSL)
## ID Group WD Weight Time
## 1 1 1 WD1 240 1
## 2 2 1 WD1 225 1
## 3 3 1 WD1 245 1
## 4 4 1 WD1 260 1
## 5 5 1 WD1 255 1
## 6 6 1 WD1 260 1
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
summary(RATSL)
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
From the Wide dataset we can see that we have 3 groups with Group 1 : 8 rats, Group 2 : 4 rats and Group 3 : 4 rats. Weight is measured in 11 timepoints 7 WD apart. Overall weight seems to increase over time.
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=8)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
RATSL <- RATSL %>%
group_by(Time) %>%
mutate( stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
glimpse(RATSL)
## Observations: 176
## Variables: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1,…
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555,…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8,…
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001,…
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=8)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$stdWeight), max(RATSL$stdWeight)))
Drawing the lines of individual rats within different groups over time. On X axis we have timepoints and on Y-axis we have weight in grams. Individuals are marked with a different linetype. Different groups are in separate graphs.
In this individual graph we can clearly see that the baseline is much lower in group 1. In group 2 is an outlier with higher starting weight which increases the total mean value.
With standardization weight mean is 0 with SD of 1. Standardization is done by grouping by time, so the time variable doesn´t show similar trend as before.
##Combining individuals to a group mean
RATSL2 <- RATSL
RATSL2 <- within(RATSL2, {Time <- factor(Time)})
RATSL2.aov <- aov(Weight ~ Group * Time + Error(ID), data = RATSL2)
summary(RATSL2.aov)
##
## Error: ID
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 2604050 1302025 88.07 2.76e-08 ***
## Residuals 13 192186 14784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Error: Within
## Df Sum Sq Mean Sq F value Pr(>F)
## Time 10 23400 2340.0 58.838 < 2e-16 ***
## Group:Time 20 4906 245.3 6.168 2.88e-11 ***
## Residuals 130 5170 39.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
n <- RATSL$Time %>% unique() %>% length()
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
ungroup()
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29, 36,…
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.375, …
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.552939, 3.3…
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1:3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1:3)) +
#geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.8)) +
scale_y_continuous(name = "mean(rats) +/- se(rats)")
The between groups test indicates that the variable group is significant, consequently in the graph we see that the lines for the three groups are rather far apart. The within subject test indicate that there is a significant time effect, in other words, the groups show increase in weight over time. The slopes of the lines are increasing in all groups. One group starts at a lower mean of weight and increase over time seems to be less than in other two groups.
From the previous graphs we could see already that group 2 has an outlier, but let us see that in better way - boxplot
ggplot(RATSL2, aes(x = factor(Time), y = Weight, fill = Group)) + geom_boxplot(position = position_dodge(width = 0.9)) + scale_x_discrete(name = "Time")
RATSL8S <- RATSL %>%
filter(Time > 0) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight) ) %>%
ungroup()
Glimpse the data
glimpse(RATSL8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273, 274…
ggplot(RATSL8S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days")
From the boxplot we can see the group 2 outlier clearly, but also in groups 1 and 3 are single outliers but they follow means trend. For sciences sake let us make these groups smaller and filter outliers.
RATSL8S1 <- RATSL8S %>%
filter(mean < 550)
RATSL8S1 <- RATSL8S %>%
filter(mean > 240)
RATSL8SG3 <- subset(RATSL8S, Group==3, select=c(Group, ID, mean)) %>%
filter(mean > 500)
RATSL8SG12 <- subset(RATSL8S1, Group==1:2, select=c(Group, ID, mean))
## Warning in `==.default`(Group, 1:2): longer object length is not a multiple of
## shorter object length
## Warning in is.na(e1) | is.na(e2): longer object length is not a multiple of
## shorter object length
RATSL8S1 <- rbind(RATSL8SG12, RATSL8SG3)
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), Days")
Without outliers groupmeans are very different
##ANOVA
Next We use anova for analysis of variance, t-test canẗ be used because variables aren´t bimodal.
fit1 <- lm(mean ~ Group, data = RATSL8S1)
summary(fit1)
##
## Call:
## lm(formula = mean ~ Group, data = RATSL8S1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0000 -2.9394 -0.4848 3.4242 7.7727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 266.955 2.999 89.02 1.35e-10 ***
## Group2 180.864 5.194 34.82 3.74e-08 ***
## Group3 269.803 4.581 58.90 1.61e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.997 on 6 degrees of freedom
## Multiple R-squared: 0.9984, Adjusted R-squared: 0.9978
## F-statistic: 1827 on 2 and 6 DF, p-value: 4.408e-09
anova(fit1)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 131409 65704 1826.7 4.408e-09 ***
## Residuals 6 216 36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we can see the group means signifiqant difference.
Using baseline as a covariate to the model we can see if there is still difference ?
baseline <- RATS$WD1
RATSL8S2 <- RATSL8S %>%
mutate(baseline)
fit2 <- lm(mean ~ baseline + Group, data = RATSL8S2)
summary(fit2)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL8S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.732 -3.812 1.991 6.889 13.455
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.14886 19.88779 1.516 0.1554
## baseline 0.93194 0.07793 11.959 5.02e-08 ***
## Group2 31.68866 17.11189 1.852 0.0888 .
## Group3 21.52296 21.13931 1.018 0.3287
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.62 on 12 degrees of freedom
## Multiple R-squared: 0.9947, Adjusted R-squared: 0.9933
## F-statistic: 747.8 on 3 and 12 DF, p-value: 6.636e-14
anova(fit2)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 252125 252125 2237.0655 5.217e-15 ***
## Group 2 726 363 3.2219 0.07586 .
## Residuals 12 1352 113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Signifiqance was lost. So the baseline should be taken account choosing a treatment group.
##Season greetings from a mental institute
The other dataset shows the brief psychiatric rating scale measurements from 40 males weekly for eight weeks. Subjecs were ramdomly selected to two treatment groups.
Wide and long form comparison was already done in the beginning of this exercise. All 40 subjects have 9 measurements so in the long set there are 360 measurements.
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66,…
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
str(BPRS)
## 'data.frame': 40 obs. of 11 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ week0 : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week1 : int 36 68 55 77 75 43 61 36 43 51 ...
## $ week2 : int 36 61 41 49 72 41 47 38 39 51 ...
## $ week3 : int 43 55 38 54 65 38 30 38 35 55 ...
## $ week4 : int 41 43 43 56 50 36 27 31 28 53 ...
## $ week5 : int 40 34 28 50 39 29 40 26 22 43 ...
## $ week6 : int 38 28 29 47 32 33 30 26 20 43 ...
## $ week7 : int 47 28 25 42 38 27 31 25 23 39 ...
## $ week8 : int 51 28 24 46 32 25 31 24 21 32 ...
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)))
It seems like there is an declining trend in both groups but there is quite a lot of noice. We do need a way to build a model and analyse if treatments differe using BPRS points as a response variable.
###First a linear model
fitbprs <- lm(bprs ~ treatment, data = BPRSL)
summary(fitbprs)
##
## Call:
## lm(formula = bprs ~ treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.944 -10.372 -2.372 5.628 57.056
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.3722 1.0193 36.663 <2e-16 ***
## treatment2 0.5722 1.4416 0.397 0.692
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.68 on 358 degrees of freedom
## Multiple R-squared: 0.0004399, Adjusted R-squared: -0.002352
## F-statistic: 0.1576 on 1 and 358 DF, p-value: 0.6916
This assumes independence of repeated measures, but this is not the case in repeated measures from same individuals, so we need to use appropriate models like the random intercept model which uses two explanatory variables (week and treatment) fits linear regression fit for each individual to differ in intercept from other individuals. BPRS scores doesn´t seem to correlate with the treatment.
The Random Intercept Model
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Â
Same type of result
Random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7977
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9803 -0.51
## Residual 97.4304 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
 Model itself doesn´t bring much new information, but the model seems to fit better using anova test.
Random Intercept and Random Slope Model with interaction
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, group=interaction(treatment, subject))) +
scale_x_continuous(name = "Week", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "bprs")
p1 + geom_line()
Fitted <- fitted(BPRS_ref2)
BPRSL <- BPRSL %>%
mutate(Fitted)
p2 <- ggplot(BPRSL, aes(x = week, y = Fitted, group=interaction(treatment, subject))) +
scale_x_continuous(name = "Week", breaks = seq(0, 60, 20)) +
scale_y_continuous(name = "bprs")
p2 + geom_line()
Random Intercept and Random Slope Model with interaction seems to describe observations well.